AN S 562A: Intro to Linear Models

Juan Steibel

Learning objectives

  • Review basic concepts of linear models

    • OLS estimates

    • Incidence matrices

    • Model selection

    • Outlier detection

  • Understand and Compare computational alterntives to obtain OLS and understand similarities and differences between OLS to MLE estimates

  • Apply R/Julia functions to fit and check linear fixed effects models

Basic Linear model

\[ \color{blue}{y_i} = \sum_{j=1}^{p} {\color{red}{x_{ij}} \color{darkgreen}{\beta_j}}+\color{orange}{e_i} \]

\[ E(\color{orange}{e_i}) = 0 \qquad(1)\]\(\color{blue}{y_i}\) observation of response variable,

\(\color{red}{x_{ij}}\) : observation of predictor variable, \(\color{darkgreen}{\beta_j}\) : linear coefficient \(\color{orange}{e_i}\) : residual value In matrix form: \[ \boldsymbol{\color{blue}y}=\boldsymbol{\color{red}X}\boldsymbol{\color{darkgreen}\beta}+\boldsymbol{\color{orange}e} \qquad(2)\]

\(rank(\boldsymbol{\color{red}X}) = p\) = matrix is full column rank. Class question.

Example 1: Longley data (only 6 obs)

     GNP.deflator     GNP Unemployed Armed.Forces Population Year Employed
1947         83.0 234.289      235.6        159.0    107.608 1947   60.323
1948         88.5 259.426      232.5        145.6    108.632 1948   61.122
1949         88.2 258.054      368.2        161.6    109.773 1949   60.171
1950         89.5 284.599      335.1        165.0    110.929 1950   61.187
1951         96.2 328.975      209.9        309.9    112.075 1951   63.221
1952         98.1 346.999      193.2        359.4    113.270 1952   63.639

\[ Employed_i=\alpha+GNP_i \beta+ e_i, \boldsymbol{y}=\begin{bmatrix}60.323 \\61.122 \\60.171 \\61.187 \\63.221 \\63.639 \\\end{bmatrix}, \boldsymbol{X} =\begin{bmatrix}1&234.289 \\1&259.426 \\1&258.054 \\1&284.599 \\1&328.975 \\1&346.999 \\\end{bmatrix} \]

Model assumptions and chosing \(\beta\)

\(E(\boldsymbol{e}) = 0\), Makes no further distributional assumptions. Then: \(E(\boldsymbol{\color{blue}y})=\boldsymbol{\color{red}X}\boldsymbol{\color{darkgreen}\beta}\)

What is the best fitting line?

Model assumptions and chosing \(\beta\)

\(E(\boldsymbol{e}) = 0\), Makes no further distributional assumptions. Then: \(E(\boldsymbol{\color{blue}y})=\boldsymbol{\color{red}X}\boldsymbol{\color{darkgreen}\beta}\)

All these lines produce residuals with expectation 0:

We need another criteria to estimate \(\beta\)

Ordinary Least Squares

Set the parameters that minimize this: \(SSE=\sum_{i=1} ^n (\hat{e}_i^2) = \boldsymbol{e^{'}}\boldsymbol{e}=(\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol{\beta}})^{'}(\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol{\beta}})\)

This means we need to take the derivative of SSE with respect to \(\hat{\boldsymbol{\beta}}\) (not shown).

Ordinary Least Squares

Set the parameters that minimize this: \(SSE = \sum_{i=1} ^n (\hat{e}_i^2) = \boldsymbol{e^{'}}\boldsymbol{e}=(\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol{\beta}})^{'}(\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol{\beta}})\)

This is called ordinary the least squares estimate or OLS of \(\boldsymbol{\beta}\) And it computed as follows:

\[\hat{\boldsymbol{\beta}} =(\boldsymbol{X^{'}} \boldsymbol{X})^{-1} \boldsymbol{X^{'}} \boldsymbol{y} \] Thus, the OLS of observations can be written:

\[ \hat{\boldsymbol{y}}=\boldsymbol{X}\hat{\boldsymbol{\beta}} = \boldsymbol{X}(\boldsymbol{X^{'}} \boldsymbol{X})^{-1} \boldsymbol{X^{'}} \boldsymbol{y} = \boldsymbol{P_{X}}\boldsymbol{y} \]

With \(\boldsymbol{P_X}\) = Projection matrix (projects response on column space of predictors in \(\boldsymbol{X}\))

This matrix is also called the Hat matrix and its diagonal elements are called “leverages”, which are used in outlier detection (see slide later).

Example 1:

\[ \boldsymbol{X^{'}} \boldsymbol{X} = \begin{bmatrix} 16& 6203 \\ 6203&2553152 \\\end{bmatrix}, \boldsymbol{X^{'}} \boldsymbol{X}^{-1} = \begin{bmatrix} 1.1e+00&-2.6e-03 \\-2.6e-03& 6.7e-06 \\\end{bmatrix} \] \[ \boldsymbol{X}^{'}\boldsymbol{y}=\begin{bmatrix} 1045 \\410323 \\\end{bmatrix}, \boldsymbol{\hat{\beta}}=\begin{bmatrix}51.844 \\ 0.035 \\\end{bmatrix} \] Predicted y is \(\boldsymbol{\hat{y}}=\boldsymbol{X}\boldsymbol{\hat{\beta}}\) (10 obs)

\[ \begin{bmatrix}60&60.9&60.8&61.7&63.3&63.9&64.5&64.5&65.7&66.4 \\\end{bmatrix} \] Observed y:

\[ \begin{bmatrix}60.3&61.1&60.2&61.2&63.2&63.6&65&63.8&66&67.9 \\\end{bmatrix} \]

Properties of the projection matrix


\(\boldsymbol{P_X}\) is an important matrix in mixed models. It’s the projection matrix and we will learn other projection matrices in this course.

  • Symmetric: \(\boldsymbol{P_X} = \boldsymbol{P_X^{'}}\)

  • Idempotent: \(\boldsymbol{P_X} = \boldsymbol{P_X} \boldsymbol{P_X}\)

  • \(\boldsymbol{P_X} \boldsymbol{X} = \boldsymbol{X}\)

  • \(rank(\boldsymbol{P_X}) = rank(\boldsymbol{X}) = tr(\boldsymbol{P_X}) = p\)

  • \(\boldsymbol{\hat{e}}=( \boldsymbol{I}-\boldsymbol{P_X})\boldsymbol{y}\)

  • \(( \boldsymbol{I}-\boldsymbol{P_X})( \boldsymbol{I}-\boldsymbol{P_X})=( \boldsymbol{I}-\boldsymbol{P_X})\)

Unbiased estimator of the variance

an unbiased estimate of the error variance, \(var(\boldsymbol{e})=\sigma^2_e\) is: \[ \hat{\sigma}_e^2=\frac{\boldsymbol{y}^{'}( \boldsymbol{I}-\boldsymbol{P_X} )\boldsymbol{y}} {n-p} \] Also, note that: \[ SSE=\boldsymbol{y}^{'}( \boldsymbol{I}-\boldsymbol{P_X} )\boldsymbol{y} \] The proof of this requires a bit of matrix algebra, you are welcome to try it and ask me questions.

Example 1:

\[ \hat{\sigma}_e^2= 0.43, \hat{\sigma}_e= 0.66, \boldsymbol{\hat{\beta}}=\begin{bmatrix}51.844 \\ 0.035 \\\end{bmatrix} \]


Call:
lm(formula = Employed ~ GNP, data = longley)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7796 -0.5544 -0.0094  0.3436  1.4459 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 51.84359    0.68137    76.1  < 2e-16 ***
GNP          0.03475    0.00171    20.4  8.4e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.66 on 14 degrees of freedom
Multiple R-squared:  0.967, Adjusted R-squared:  0.965 
F-statistic:  415 on 1 and 14 DF,  p-value: 8.36e-12

Variance of (linear combinations of) estimates of linear parameters

The variance of \(\hat{\boldsymbol{\beta}}\) is: \[ var(\hat{\boldsymbol{\beta}})=(\boldsymbol{X^{'}X})^{-1}{\sigma}_e^2\\ var(\boldsymbol{K}\hat{\boldsymbol{\beta}})=\boldsymbol{K}(\boldsymbol{X^{'}X})^{-1}\boldsymbol{K}^{'}{\sigma}_e^2 \] where \(\boldsymbol{K}\) is a contrast matrix.

Challenge for the class: select \(\boldsymbol{K}\) so that \(\boldsymbol{K}\hat{\boldsymbol{\beta}}\) is just the \(i^{th}\) element of \(\hat{\boldsymbol{\beta}}\)

Another challenge: in practice, we don’t know \(\sigma_e^2\) what do we do then?

Computation of OLS

First: take a look on the OLS equations:

\[\hat{\boldsymbol{\beta}} =(\boldsymbol{X^{'}} \boldsymbol{X})^{-1} \boldsymbol{X^{'}} \boldsymbol{y} \]

Note: \((\boldsymbol{X^{'}} \boldsymbol{X})\) is the key matrix and the computational requirements are:

  • To build it through the cross product: \(O(n^2p)\)

  • To invert it using a standard algoritm: \(O(p^3)\)

Note: these are upper bounds as more efficient algoritms can be used to take advantage of special cases (symmetry, sparseness, etc).

OLS and QR decomposition

\[ \boldsymbol{X}=\boldsymbol{Q} \boldsymbol{R}=\begin{bmatrix} \boldsymbol{Q}_1 & \boldsymbol{Q}_2 \end{bmatrix}\begin{bmatrix} \boldsymbol{R}_1\\ \boldsymbol{0} \end{bmatrix} \] Where \(\boldsymbol{Q}\) and \(\boldsymbol{Q}_1\) are orthogonal matrices and \(\boldsymbol{Q}^{-1}=\boldsymbol{Q}^{'}\) and \(\boldsymbol{R}_1\) is an upper triangular matrix.

We can now express \(\boldsymbol{X}\) as a function of these matrices: \[ \boldsymbol{X}=\boldsymbol{Q}_1\boldsymbol{R}_1+\boldsymbol{Q}_2\boldsymbol{0}=\boldsymbol{Q}_1\boldsymbol{R}_1 \] Now we can use this expression for obtaining OLS

OLS and QR decomposition

\[ \boldsymbol{X}^{'}\boldsymbol{X}=(\boldsymbol{Q}_1 \boldsymbol{R}_1)^{'}\boldsymbol{Q}_1 \boldsymbol{R}_1=\boldsymbol{R}_1^{'}\boldsymbol{Q}_1^{'}\boldsymbol{Q}_1\boldsymbol{R}_1=\boldsymbol{R}_1^{'}\boldsymbol{R}_1 \] \[ (\boldsymbol{X}^{'}\boldsymbol{X})^{-1}= (\boldsymbol{R}_1^{'}\boldsymbol{R}_1)^{-1}=(\boldsymbol{R}_1)^{-1} (\boldsymbol{R}^{'}_1)^{-1} \] \[ \boldsymbol{X}^{'}\boldsymbol{y}=(\boldsymbol{Q}_1 \boldsymbol{R}_1)^{'}=\boldsymbol{R}_1^{'}\boldsymbol{Q}_1^{'} \] \[ (\boldsymbol{X}^{'}\boldsymbol{X})^{-1}\boldsymbol{X}^{'}\boldsymbol{y}=\boldsymbol{R}_1^{-1} (\boldsymbol{R}_1^{'})^{-1}\boldsymbol{R_1}^{'}\boldsymbol{Q}_1^{'}\boldsymbol{y}=\boldsymbol{R}_1^{-1}\boldsymbol{Q}_1^{'}\boldsymbol{y} \] There are efficient algorithms to invert triangular matrices. Then, depending on the structure of \(\boldsymbol{X}\) and of\(\boldsymbol{X}^{'}\boldsymbol{X}\), direct inverse of desomposition may be computationally more convenient.

Prediction of the mean

Suppose we want to predict the mean response for a specific value of the predictor variables represented in row-vector \(\boldsymbol{x}_{new}\): \[ \bar{\boldsymbol{y}}_{new}=\boldsymbol{x}_{new}\hat{\boldsymbol{\beta}} \] We have the following properties: \[ \bar{E(\boldsymbol{y}_{new})}=\boldsymbol{X}_{new}{\boldsymbol{\beta}} \] and \[ var(\bar{\boldsymbol{y}}_{new})= \boldsymbol{x}_{new}var(\hat{\boldsymbol{\beta}})\boldsymbol{x}_{new}^{'}= \boldsymbol{x}_{new}(\boldsymbol{X^{'}X})^{-1}\boldsymbol{x}_{new}^{'}{\sigma}_e^2 \]

Prediction of future observations

Similarly, if we want to predict the future observations for a specific value of the predictor variables represented in row-vector \(\boldsymbol{x}_{new}\): \[ \hat{\boldsymbol{y}}_{new}=\boldsymbol{x}_{new}\hat{\boldsymbol{\beta}} \] We have the following properties: \[ \hat{E(\boldsymbol{y}_{new})}=\boldsymbol{X}_{new}{\boldsymbol{\beta}} \] and \[ var(\hat{\boldsymbol{y}}_{new})= \boldsymbol{x}_{new}var(\hat{\boldsymbol{\beta}})\boldsymbol{x}_{new}^{'}+var(\boldsymbol{e_i})= \boldsymbol{x}_{new}var(\hat{\boldsymbol{\beta}})\boldsymbol{x}_{new}^{'}+{\sigma}_e^2 \] \[ = (1+\boldsymbol{x}_{new}(\boldsymbol{X^{'}X})^{-1}\boldsymbol{x}_{new}^{'}){\sigma}_e^2 \]

Compare this expression to the variance of the expected value: \[ var(\bar{\boldsymbol{y}}_{new})= \boldsymbol{x}_{new}(\boldsymbol{X^{'}X})^{-1}\boldsymbol{x}_{new}^{'}{\sigma}_e^2 \]

Example 2: Incidence matrix for classification factors

Tool Growth. Teeth length of guinea pigs with vitamin supplementations: 2 formulations x 3 doses. 10 replicates per group.

   len supp dose
1  4.2   VC  0.5
2 11.5   VC  0.5
3  7.3   VC  0.5
4  5.8   VC  0.5
5  6.4   VC  0.5
6 10.0   VC  0.5

Let`s assume that supplement type and dose are classification factors and build incidence matrices for the 2x3 factorial design

Example 2: not full rank

Incidence matrix for each level combination of both factors \[ \begin{bmatrix}VC&0.5 \\VC&1 \\VC&2 \\OJ&0.5 \\OJ&1 \\OJ&2 \\\end{bmatrix}, \begin{bmatrix}1&1&0&0&0&1&0&1&0&0&0&0 \\1&0&1&0&0&1&0&0&0&1&0&0 \\1&0&0&1&0&1&0&0&0&0&0&1 \\1&1&0&0&1&0&1&0&0&0&0&0 \\1&0&1&0&1&0&0&0&1&0&0&0 \\1&0&0&1&1&0&0&0&0&0&1&0 \\\end{bmatrix} \] This matrix is not full rank.
Challenge 1: What columns correspond to each factor?
Challenge 2: What is the rank of this matrix?

Example 2: full rank via corner parameterization

Let’s use the corner parameterization to obtain a full rank matrix

\[ \begin{bmatrix}VC&0.5 \\VC&1 \\VC&2 \\OJ&0.5 \\OJ&1 \\OJ&2 \\\end{bmatrix}, \begin{bmatrix}1&1&0&0&0&0 \\1&1&1&0&1&0 \\1&1&0&1&0&1 \\1&0&0&0&0&0 \\1&0&1&0&0&0 \\1&0&0&1&0&0 \\\end{bmatrix} \] Challenge: Explain how the matrix was forced into full rank

Example 2 alternative full rank

Another alternative is the cell-mean model parametrization: \[ \begin{bmatrix}VC&0.5 \\VC&1 \\VC&2 \\OJ&0.5 \\OJ&1 \\OJ&2 \\\end{bmatrix}, \begin{bmatrix}0&1&0&0&0&0 \\0&0&0&1&0&0 \\0&0&0&0&0&1 \\1&0&0&0&0&0 \\0&0&1&0&0&0 \\0&0&0&0&1&0 \\\end{bmatrix} \]

Example 2: OLS equations for corner parameterization

\[ (\boldsymbol{X^{'}X})=\begin{bmatrix}60&30&20&20&10&10 \\30&30&10&10&10&10 \\20&10&20&0&10&0 \\20&10&0&20&0&10 \\10&10&10&0&10&0 \\10&10&0&10&0&10 \\\end{bmatrix} \]

Example 2: OLS equations for cell mean model

\[ (\boldsymbol{X^{'}X})=\begin{bmatrix}10&0&0&0&0&0 \\0&10&0&0&0&0 \\0&0&10&0&0&0 \\0&0&0&10&0&0 \\0&0&0&0&10&0 \\0&0&0&0&0&10 \\\end{bmatrix} \] Challenge: Which one of the two are easier to invert?

Assuming normality

So far we did not make any parametric assumption. But we can assume that the residuals are Gaussianly independent and identically distributed:

\[ \boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{e}\\ \boldsymbol{e} \sim N(\boldsymbol{0},\boldsymbol{I}{\sigma}_e^2) \] Under these assumptions, we can obtain the Maximum Likelihood Estimate of the parameters by finding the maximum of the log-likelihood (which is the probability density function of the data seen as a function of the parameters)

Estimates of parameters under the normal linear model

\[ \boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{e}, \boldsymbol{e} \sim N(\boldsymbol{0},\boldsymbol{I}{\sigma}_e^2), \] \[ \hat{\boldsymbol{\beta}} =(\boldsymbol{X^{'}} \boldsymbol{X})^{-1} \boldsymbol{X^{'}} \boldsymbol{y}, \] \[ \hat{\boldsymbol{\beta}} \sim N({\boldsymbol{\beta}},(\boldsymbol{X^{'}X})^{-1}{\sigma}_e^2) \] \[ \hat{\sigma}_e^2=\frac{\boldsymbol{y}^{'}( \boldsymbol{I}-\boldsymbol{P_X} )\boldsymbol{y}} {n-p} \] These estimates, variances and expectations coincide with OLS for this model

Inference with unknown variances

\[ \hat{\boldsymbol{\beta}} \sim N({\boldsymbol{\beta}},(\boldsymbol{X^{'}X})^{-1}{\sigma}_e^2) \] For a single coefficient: \[ \hat{\beta_j} \sim N(\beta_j,d_j \sigma_e^2), \]

where \(d_j\) is the \(i^{th}\) diagonal element of \((\boldsymbol{X^{'}X})^{-1}\). With unknown residual variance: \[ \hat{\beta_j} \sim t({\beta_j},d_j\hat{\sigma}_e^2,n-p), \]

Use to obtain confidence intervals and hypothesis tests for regression coefficients

Estimation of expected values under the Gaussian GLM

Given a new matrix \(\boldsymbol{X}_{new}\), the expected value is: \[ \hat{E(\boldsymbol{y}_{new})}=\bar{\boldsymbol{y}}_{new}=\boldsymbol{X}_{new}\hat{\boldsymbol{\beta}} \]

\[ \bar{\boldsymbol{y}}_{new} \sim N(\boldsymbol{X}_{new}{\boldsymbol{\beta}}, \boldsymbol{X}_{new}(\boldsymbol{X^{'}X})^{-1}\boldsymbol{X}_{new}^{'}{\sigma}_e^2) \] This result is used to obtain the confidence band of a regression. And similar to the previous slide, the estimated variance can be replaced and the distribution of a single estimated coefficient will be the student-t distribution.

Prediction of future observations under the Gaussian Linear Model

\[ \hat{\boldsymbol{y}}_{new}=\boldsymbol{X}_{new}\hat{\boldsymbol{\beta}} \]

\[ \hat{\boldsymbol{y}}_{new} \sim N(\boldsymbol{X}_{new}{\boldsymbol{\beta}}, (\boldsymbol{X}_{new}(\boldsymbol{X^{'}X})^{-1}\boldsymbol{X}_{new}^{'}+\boldsymbol{I}){\sigma}_e^2) \] This result is used to obtain the prediction band of a regression.

Estimating the mean vs predicting future observations

\[ \bar{\boldsymbol{y}}_{new} \sim N(\boldsymbol{X}_{new}{\boldsymbol{\beta}}, \boldsymbol{X}_{new}(\boldsymbol{X^{'}X})^{-1}\boldsymbol{X}_{new}^{'}{\sigma}_e^2) \] versus \[ \hat{\boldsymbol{y}}_{new} \sim N(\boldsymbol{X}_{new}{\boldsymbol{\beta}}, (\boldsymbol{X}_{new}(\boldsymbol{X^{'}X})^{-1}\boldsymbol{X}_{new}^{'}+\boldsymbol{I}){\sigma}_e^2) \] Class challenge: make at least two observations about these two expressions

Hypothesis testing

Our goal is not to learn about hypothesis testing for the GLM or for the GLMM. However, we may have to test hypotheses. Here we review BASIC properties regarding hypothesis testing.

for testing individual coefficients: \[ \hat{\beta_j} \sim t({\beta_j},(d_j\hat{\sigma}_e^2,n-p), \] where \(d_j\) is the \(i^{th}\) diagonal element of \(\boldsymbol{X^{'}X})^{-1}\).
for testing several coefficients at the same time, under the null: \[ \boldsymbol{K}\hat{\boldsymbol{\beta}}\boldsymbol{K}^{'}\sim F(rank(\boldsymbol{K}),n-p) \] Matrix \(\boldsymbol{K}\) is built to represent the corresponding columns of \(\boldsymbol{X}\)

Hypothesis testing: Sequential vs Marginal tests

Revisit Example 2: \(2 \times 3\) factorial with: Supplement type:\(Sup\), dose: \(Dose\) and they interaction: “SxD”.

There are two main ways in which we can proceed in this case to test these factors:

Sequential test (type I)
1: \(F(Sup|1)\)
2: \(F(Dose|Sup+1)\)
3: \(F(D \times S|Dose+Sup+1)\)

Marginal test (type III)
1:\(F(Sup|Dose+D \times S+1)\)
2:\(F(Dose|Sup+D \times S+1)\)
3:\(F(D \times S) Dose+Sup+1)\)

The main difference between these approches is that the order of testing matters in type I tests. Thus, this is only recommended when the order for columns of \(\boldsymbol X\) is pre-specified.\

A dissadvantage of type III tests is that none of the variables may test as significantly associated to the response variable due to colinearity, thus this may not be a good way of selecting variables either.

Model selection: statistical approach

It may be very important to all downstream analysis the selection an appropriate model (i.e: relevant columns in \(\boldsymbol{X}\)).

The most common way to do this is to fit several competing models and select the “best one” according to a certain criteria.

The \(SSE\) of a model is a measure of the variance in the response that is lest unexplained by the model. Contrarily, \(R^2\) is the proportion of variance that is not explained by the model: \[ R^2 = 1-{SSE \over SSTotal} \]

Challente: Are \(SSE\) or \(R^2\)a good criteria for model selection?

Model selection: statistical approach

The most common statistical approach to model selection in the GLM is to use a function of \(SSE\) or \(R^2\)penalized by model size.

Mallow’s CP \[ C_p={{SSE_p}\over{MSE_{full}}}-n+2(p+1) \] (smaller is better)

Adjusted \(R^2\) \[ 1-(1-\boldsymbol{R}^2){{n-1}\over{n-p}} \] (larger is better)

These are only two of many model selection criteria. These two model criteria don’t require any distributional assumtion beyond what OLS requires. Other criteria such as BIC, AIC, DIC, etc, rely on specific distributional assumptions and are commonly used with likelihood methods.

Model selection: machine learning approach

Instead of penalizing \(SSE\) or \(R^2\), the machine learning approach separates data intro training and testing. Models is fit in the training dataset and the model (lack-of-)fit are computed on the testing set.

Challenge: list advantages and disadvantages of the proposed approach.

Outlier detection: statistical approach

Outliers are values associated with large residuals. The simplest form of outlier detection is a graphic of residuals vs observed values:

However the scale of these residuals is hard to judge. So it’s better to produce standardized residuals: \[ t_i={\hat{e}_i\over \sqrt{var(\hat{e}_i)}} \]

Outlier detection statistical approach

The variance of a residual is a function of the error variance and the diagonal element of the projection matrix \(P_X\), also called “hat” matrix: \[ var(\hat{e}_i)=\sigma_e^2 (1-h_{ii}) \] where \(h_{ii}\) is the ith diagonal element of \(P_X\). If instead of the population variance, an estimate is replaced into the equation, then the residual is called “studentized” instead of standardized.

Standardized or Studentized follow a Gaussian or student-t distribution: it’s easier to judge large values of the residual.

Machine learning approach to outlier detection

An approach common to both machine learning and classic analysis consists of computing the standardized residuals for each observation, but using a model fit under leave-one-out cross validation.

This means that the observation in question is dropped from the training set one at a time.

In classic analysis this was often called “external residual approach” and in machine learning it’s called leave-one-out residual analysis. The names are used today exchangeably by everyone.

Another procedure commonly used in machine learning is to use a robut model, for instance: quantile regression to compute residuals and their prediction bands.

Quantile regression is outside the scope of this course. But I included an example in the following slides.

Example: outlier detection with robust regression.

Problem: with influential outliers, OLS estimate change and some points do not look like outliers anymore

Example: outlier detection with robust regression.

Using Robust regression helps with this problem: notice that the line barely changes when outliers are introduced.

Example: outlier detection with robust regression.

But, of course, that depends on the % of outliers.
Look at the same experiment when more days are added and the feeder was not recalibrated:

Example: Animal 1361!

Example: outlier detection with robust regression.

With well behaved data everything works… but then, we may not need outlier detection then :)